Before you start, create an Rproject for HW1 as always.

1 Overview

This is a fast-paced course that covers a lot of material. There will be a large number of references. You may need to do your own research to fill in the gaps between lectures and homework/projects. It is impossible to learn data science without getting your hands dirty. Please budget your time evenly. A last-minute work ethic will not work for this course.

Homework in this course is different from your usual homework assignment as a typical student. Most of the time, they are built over real case studies. While you will be applying methods covered in lectures, you will also find that extra teaching materials appear here. The focus will be always on the goals of the study, the usefulness of the data gathered, and the limitations in any conclusions you may draw. Always try to challenge your data analysis in a critical way. Frequently, there are no unique solutions.

Some case studies in each homework can be listed as your data science projects (e.g. on your CV) where you see fit.

1.1 Objectives

  • Get familiar with R-studio and RMarkdown
  • Hands-on R
  • Learn data science essentials
    • gather data
    • clean data
    • summarize data
    • display data
    • conclusion
  • Packages
    • dplyr
    • ggplot
    • sf

Handy cheat sheets

1.2 Instructions

  • Homework assignments can be done in a group consisting of up to two members. Please find your group members as soon as possible and register your group on our Canvas site.

  • All work submitted should be completed in the R Markdown format. You can find a cheat sheet for R Markdown here For those who have never used it before, we urge you to start this homework as soon as possible.

  • Submit the following files, one submission for each group: (1) Rmd file, (2) a compiled HTML or pdf version, and (3) all necessary data files if different from our source data. You may directly edit this .rmd file to add your answers. If you intend to work on the problems separately within your group, compile your answers into one Rmd file before submitting. We encourage you at least to attempt each problem by yourself before working with your teammates. Additionally, ensure that you can ‘knit’ or compile your Rmd file. It is also likely that you need to configure Rstudio to properly convert files to PDF. These instructions might be helpful.

  • In general, be as concise as possible while giving a fully complete answer to each question. All necessary data sets are available in this homework folder on Canvas. Make sure to document your code with comments (written on separate lines in a code chunk using a hashtag # before the comment) so others can follow along. R Markdown is particularly useful because it follows a ‘stream of consciousness’ approach: as you write code in a code chunk, make sure to explain what you are doing outside of the chunk.

  • Control your the output of each chunk using the echo=F, include=F, results='hide in the header of the chunk. You can set it globally (for all the chunks) using knitr::opts_chunk$set() in the first chunk above.

  • It is important to let your reader/audience know what you are plotting. Please label your ggplots clearly using ggtitle(), xlab(), ylab(), etc.

  • A few good or solicited submissions will be used as sample solutions. When those are released, make sure to compare your answers and understand the solutions.

1.3 Review materials

  • Study Basic R Tutorial
  • Study Advanced R Tutorial (dplyr and ggplot)
  • Study Module 1 EDA and Module 2 Spatial data

2 Case study 1: Citibike: weather affect

At the end of Module 1, we ask whether the weather can be an important factor to understand and predict bike usage. Let’s investigate how the weather affects Citibike usage.

2.1 Data acquisition

The first step is to acquire NYC weather in 2015. We have already scrapped the hourly weather data from Darksky API. The following code demonstrates how the data were scrapped and converted into a data frame. The final weather data is in NYC_weather_2015.csv.

Note: You do NOT need to run the following code chunk. By setting eval = FALSE in the chunk header, it is configured not to run when knitting the document.

# key = "obtain your key"
# darksky_api_key(force = TRUE)
# key 
# 
# unique_dates <- seq(as.Date("2015/01/01"), as.Date("2015/12/31"), "days")
# 
# weather_df <- unique_dates %>% 
#   map(~get_forecast_for(40.766048, -73.977320, .x)) %>% 
#   map_df("hourly") %>% 
#   mutate(loc = "Central Park",
#          date = date(time), 
#          lat =  as.numeric("40.766048"), 
#          long = as.numeric("-73.977320")) %>% 
# filter(time >= "2015-01-01 00:00:00") %>%
# select(time:icon, precipIntensity, temperature, humidity, windSpeed) 

# write.csv(weather_df, "NYC_weather_2015.csv")

2.2 Data preparation

2.2.1 Understand and clean the data

  1. Read data/NYC_weather_2015.csv into R.
# write your code here
# make sure to hide unnecessary outputs using results='hide'
# and to hide unnecessary outputs using echo=F

weather <- read.csv("data/NYC_weather_2015.csv")
  1. Set the variable natures properly, specifically convert time as POSIXct, summary and icon as factors.
weather <- weather %>% mutate(time = mdy_hm(time))

# mutate the variables
weather <- weather %>% mutate(summary = factor(summary))
weather <- weather %>% mutate(icon = factor(icon))

# check that it worked out
str(weather)
  1. Any missing values?
sum(is.na(weather))
## [1] 0

We do not have missing values.

  1. Do we have all the hourly weather? If not, which days are missing or which days have missing hours? (Hints: use month() and day() in lubridate package to get the month and day from time and then use unique() to first check if we have all 356 days. To check whether we have all 24 hours for every day, use group_by() and summarize() to calculate the number of observations by each day. Use filter() to see whether we have 24 observations for each day.)
library(lubridate)

month_data_counts <- weather %>% 
  group_by(month = month(time), day = day(time)) %>%
  summarise(count = n()) 

# Examine which have less or more than 24 hours a day
incorrect_data <- month_data_counts[month_data_counts$count != 24, ]
incorrect_data
## # A tibble: 2 × 3
## # Groups:   month [2]
##   month   day count
##   <dbl> <int> <int>
## 1     3     8    23
## 2    11     1    25

We are missing an hour from March 8th and have an additional hour for November 1st. Investigating the weather data for these days reveals March 8th is missing 2am and November 1st has an additional 1am, these irregularities are due to daylight savings.

To fix this take out the extra 1am on November 1st

weather <- weather[-7298,]

2.2.2 A quick look into the data

  1. How many types of weather (summary) are there in this data? (Hints: use unique().)
unique(weather$summary)
##  [1] Clear                      Partly Cloudy             
##  [3] Mostly Cloudy              Overcast                  
##  [5] Possible Flurries          Light Sleet               
##  [7] Light Rain                 Rain                      
##  [9] Drizzle                    Possible Drizzle          
## [11] Foggy                      Possible Light Sleet      
## [13] Possible Light Rain        Light Snow                
## [15] Possible Light Snow        Flurries                  
## [17] Humid and Overcast         Humid and Mostly Cloudy   
## [19] Humid                      Humid and Partly Cloudy   
## [21] Light Rain and Humid       Possible Drizzle and Humid
## [23] Drizzle and Humid          Rain and Humid            
## [25] Cloudy                    
## 25 Levels: Clear Cloudy Drizzle Drizzle and Humid Flurries Foggy ... Rain and Humid

There are 25 types of weather.

  1. The icon refers to the icon in the iOS weather app. What is the correspondence between icon and summary? (Hint: use group_by() and summarise().)
icon_summary_correlation <- weather %>% 
  group_by(icon = icon, summary = summary) %>% 
  summarize(count = n())

The summaries and the icons match fairly well between the weather data and the iOS weather app.

  1. Create a new variable weather by grouping some levels in icon together: “clear-night” and “clear-day” into “clear”, “partly-cloudy-night” and “partly-cloudy-day” into “cloudy”, i.e., weather has 6 categories/conditions: “clear”, “cloudy”, “snow”, “sleet”, “rain”, and “fog”. Remember to first convert icon into character so that we can add more levels.
weather$weather <- as.character(weather$icon)
weather$weather[weather$weather %in% c("clear-night", "clear-day")] <- "clear"

weather$weather[weather$weather %in% c("partly-cloudy-night", "partly-cloudy-day")] <- "cloudy"

str(weather)
  1. How many days are there for each weather condition in 2015?
# check results
table(weather$weather)
## 
##  clear cloudy    fog   rain  sleet   snow 
##   4680   3360    119    544      7     49

2.2.3 Merging Citi bike data with weather data

Next we need to merge the bike data data/citibike_2015.csv with the weather data by hours. Let’s first read in the bike data and convert the variables into appropriate formats.

# read in data
bike <- read.csv("data/citibike_2015.csv")

# change to correct data type
bike <- bike %>% mutate(usertype = factor(usertype), 
                        gender = factor(gender),
                        starttime_standard = ymd_hms(starttime_standard),
                        stoptime_standard = ymd_hms(stoptime_standard))

The following chunk creates a starttime_hour variable to get the starting hour for each trip. Use left_join() to join bike and weather data by hours.

# create a starttime hour with just the hour
bike <- bike %>% mutate(starttime_hour = floor_date(starttime_standard, unit = "hour"))

# join with the original data
bike <- bike %>%
  left_join(weather, by = c("starttime_hour" = "time"))

2.3 Weather effect

Now we are ready to investigate the relationship between weather condition and bike usage.

  1. Calculate the average hourly rentals by weather conditions and show a corresponding barplot (Hints: average hourly rentals = total number of trips/total number of hours by each weather condition.) Is there evidence that people are less likely to rent bikes during bad weather? Summarize your findings using less than 3 sentences.
# calculate the total number of trip by each weather condition
weather_n_trip <- bike %>%
  group_by(weather) %>%
  summarise(n_trip = n())

# calculate the total number of hours of each weather condition
weather_n <- weather %>%
  group_by(weather) %>%
  summarise(n_weather = n())

# merge the two
weather_n_trip <- left_join(weather_n_trip, weather_n, by = "weather")
# calculate the average hourly rentals by weather conditions
weather_n_trip <- weather_n_trip %>% mutate(avg_hourly_rental = n_trip / n_weather)


# use geom_bar(stat = "identity") to plot a barchat
ggplot(weather_n_trip, aes(x = weather, y = avg_hourly_rental)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Average Hourly Bike Rentals in Different Weather Conditions", x = "Weather Condition", y = "Average Hourly Rentals")

The average hourly bike rentals is significantly reduced (by around 90%) during sleet and snow and reduced by around 50% under fog and rain. These findings support the claim that people are less likely to rent bikes during bad weather. Average hourly bike rentals are highest under clear and cloudy conditions.

  1. What about the trip duration under different weather conditions? Provide summary statistics and a boxplot to show whether there exist patterns in trip duration (duration_m) under different weather conditions. Briefly summarize your findings.
# Summary Stats
data_summary <- bike %>% 
  group_by(weather) %>% 
  summarise(across(c(duration_m), list(mean = mean, median = median, q25 = ~quantile(.,0.25), q75 = ~quantile(.,0.75))))

data_summary
## # A tibble: 6 × 5
##   weather duration_m_mean duration_m_median duration_m_q25 duration_m_q75
##   <chr>             <dbl>             <dbl>          <dbl>          <dbl>
## 1 clear             14.2              10.8            6.63           17.8
## 2 cloudy            13.7              10.4            6.48           17.2
## 3 fog               11.5               9.09           6.00           14.5
## 4 rain              11.3               8.8            5.75           14.2
## 5 sleet             29.3              25.0           13.9            40.4
## 6 snow               9.57              7.87           5.05           12.4
# boxplots
ggplot(bike) + geom_boxplot(aes("", duration_m)) + facet_wrap(~ weather) + labs(title = "Trip Durations Under Weather Conditions", x = "Weather Condition", y = "Trip Duration (minutes)")

These boxplots reveal that there are more outlying long trips on clear and cloudy days, as well as some on rainy days but that the average trip duration and quartiles stay fairly similar across all weather patterns except for sleet which seems to increase the trip duration average and quartiles by around 15 or so minutes. There are very few if any outlying long trips under sleet and snow and very few under fog. This seems to imply that there are only ‘joy rides’ under clear and cloudy conditions.

2.4 Trend by the hour of the day

  1. As we see in class, the two rush-hour periods account for most of the trips. And we have also observed that the weather condition affects the likelihood of renting bikes. How does the weather condition affect the likelihood of renting bikes, especially during rush hours? Show the average hourly rentals by hour of the day and by weather condition.
hourly_weather_trips <- bike %>% 
  group_by(weather, hour = hour(starttime_standard)) %>% 
  summarise(n_trip = n(), .groups = "drop")

ggplot(hourly_weather_trips, aes(x=hour, y=n_trip, color= weather, group=weather)) +
  geom_line(size = 1) +
  theme_minimal() +
  labs(title = "Number of Trips by Hour and Weather Condition", 
       x = "Hour of the Day",
       y = "Number of Trips")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

  1. We show in class that the usage patterns between weekdays and weekends vary a lot. Do people react to weather conditions differently between weekdays and weekends? Show the average hourly rentals by the hour of the day and by weather conditions between weekdays and weekends (using facet_wrap()) Briefly summarize your findings.
# add weekend info
bike %>% mutate(weekend_num = wday(starttime_standard)) %>% select(starttime_standard, weekend_num) %>% head(5)
##    starttime_standard weekend_num
## 1 2015-04-02 14:56:46           5
## 2 2015-04-26 11:31:58           1
## 3 2015-08-10 17:56:46           2
## 4 2015-08-07 20:24:07           6
## 5 2015-09-28 20:12:25           2
bike <- bike %>% mutate(weekend = ifelse(wday(starttime_standard) %in% 6:7,
                                         "Weekend", "Weekday"))

hourly_weather_trips_class <- bike %>% 
  group_by(weather, hour = hour(starttime_standard), weekend) %>% 
  summarise(n_trip = n(), .groups = "drop")


ggplot(hourly_weather_trips_class, aes(x=hour, y=n_trip, color= weather, group=weather)) +
  geom_line(size = 1) +
  theme_minimal() +
  facet_wrap(~weekend) +
  labs(title = "Number of Trips by Hour and Weather Condition", 
       x = "Hour of the Day",
       y = "Number of Trips")

People seem to respond similarly to weather conditions between weekdays and weekends. There are more trips during the weekdays but the relative amount of trips taken between weather conditions follows a similar pattern.

2.5 Temperature

  1. We observe that there are more bike trips during warmer months. Show the average hourly rentals by different temperatures. (Hint: use cut() function to bin temperature and then calculate the average hourly rentals by different temperature bins.)
# break into bins
 bike <- bike %>%
   mutate(temp_group = cut(temperature, breaks = seq(0, 110, 10))) 

# break into bins
weather <- weather %>%
   mutate(temp_group = cut(temperature, breaks = seq(0, 110, 10))) 

# group and summarize
temp_hourly_rentals <- bike %>% 
  group_by(temp_group) %>% 
  summarise(avg_hourly_rentals = n() / n_distinct(hour(starttime_hour)), .groups = "drop")

# graph
ggplot(temp_hourly_rentals, aes(x=temp_group, y=avg_hourly_rentals)) +
  geom_bar(stat = "identity", fill = "blue") +
  theme_minimal() +
  labs(title = "Number of Trips by Hour and Weather Condition", 
       x = "Hour of the Day",
       y = "Number of Trips") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

  1. Do people ride longer trips when the temperature is higher? Use a scatter plot to show the relationship between duration_m and temperature. You can further impose a regression line to support your argument using geom_smooth(method = lm).
# your code here

ggplot(bike, aes(x=temperature, y=duration_m)) +
  geom_point(alpha = 0.5, color = "blue") + 
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  theme_minimal() +
  labs(title = "Trip Duration vs Temperature", x = "Temperature (F)", y = "Trip Duration (minutes)")

3 Case study 2: Citibike: proximity to subway stations

At the end of Module 1, we also ask whether proximity to public transportation can be an important factor to predict bike usage. Let’s investigate how the proximity to subway stations affects the total number of trips.

3.1 Data preparation

We obtain the geographical information (shapefiles) of subway stations from NYC Open Data. We use read_sf() to read the shapefile data with the WGS84 coordinate reference system.

subway <- read_sf("data/nyc_subway/DOITT_SUBWAY_STATION_04JAN2017.shp") %>%
  st_transform(crs = 4326)

Similar to what we did in class, let’s calculate the total number of trips by each station and convert it into an sf object.

trips_by_station <- bike %>%
  group_by(station = start.station.id) %>%
  summarise(lat = as.numeric(start.station.latitude[1]),
            long = as.numeric(start.station.longitude[1]),
            station = start.station.name[1],
            num_trip = n())

trips_by_station_sf <- st_as_sf(trips_by_station, 
                                coords = c("long", "lat"), 
                                crs = 4326) 

3.2 Visualisation using mapview()

Plot the subway stations and Citi bike stations using mapview(). Color the Citi bike stations by the total number of trips (num_trip) and color the subways stations red. Briefly summarize your findings.

mapview(trips_by_station_sf, zcol = "num_trip") + 
  mapview(subway, color = "red", col.regions = "red", cex = 1)

This graphic demonstrates that the bike stations largely follow along the subway lines which makes sense because those using the subway may need to bike further to their destination after taking the subway or may need to bike to the subway for further transportation.

3.3 Distance to the nearest station

  1. Calculate the distance to the closest subway stations for each bike station. (Hints: use st_nearest_feature() and st_distance().)
# get the index of the closest subway for each station
nearest_route <- st_nearest_feature(trips_by_station_sf, subway)
# calculate the distances to the closest subway for each station
distances <- st_distance(trips_by_station_sf, subway[nearest_route, ], 
                         by_element=TRUE)
trips_by_station_sf$dist_to_route <- distances
  1. Is there any evidence that if a bike station is closer to the subway station, more trips are starting from that station? Use a scatter plot to support your answer.
mapview(trips_by_station_sf, zcol = "dist_to_route") +
  mapview(subway, color = "red", col.regions = "red", cex = 1)

This graphic shows that there is a clear concentration of a large number of trips started at a station bike for stations that are closer to subway stations. This can be seen by the concentration of darker purple circles around the red subway dots.

3.4 Number of stations within 200 meters

Another proxy to measure the proximity to subway stations is the total number of stations within some distance. Calculate the number of subway stations within 200m for each bike station. (Hints: use st_buffer() to create a buffer for each station and then use st_join to join the buffered bike station with the subway stations.) Can we also conclude that if a bike station is close to more subway stations, there will be more trips starting from that station?

Note: Some subway stations have the same names but serve for different subway lines. For our HW, please just count them as separate stations as long as they have different OBJECTIDs.

# buffer and filter
stations_buffer_sf <- trips_by_station_sf %>% st_buffer(dist = 200) # buffer by about 200m
subway_station <- st_filter(subway, stations_buffer_sf)

# join
subways_near_bikes <- st_join(subway, stations_buffer_sf) %>% 
  group_by(OBJECTID, NAME) %>% 
  summarise(num_subway_in200 = n(), num_trip = sum(num_trip))

# plot on map
mapview(subways_near_bikes, zcol = "num_trip", cex = "num_subway_in200")
# plot on scatterplot
ggplot(subways_near_bikes, aes(x = num_subway_in200, y = num_trip)) +
  geom_point() +
  geom_smooth() +
  labs(title = "Number of Trips for Bike Stations and Number of Close Subway Stations", x = "Number of Close Subway Stations", y = "Number of Trips from Bike Station") +
  theme_minimal()
## Warning: Removed 328 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 0.98
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 1.02
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 1.5962e-30
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 1
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 0.98
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 1.02
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 1.5962e-30
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 1
## Warning: Removed 328 rows containing missing values or values outside the scale range
## (`geom_point()`).

# Plot boxplots
ggplot(subways_near_bikes) + 
  geom_boxplot(aes("", num_trip)) + 
  facet_wrap(~ num_subway_in200) + 
  labs(title = "Number of Trips for Bike Stations and Number of Close Subway Stations", x = "Number of Close Subway Stations", y = "Number of Trips from Bike Station") +
  theme_minimal()
## Warning: Removed 328 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

4 Discussion

In this homework, we explored how weather and the proximity to subway stations affect Citi bike usage. What other possible factors do you think may affect Citibike usage? Write down your plan to explore these factors, starting from data acquisition (using an official data source or conducting a survey), EDA to the final conclusion.

I would choose to investigate the proximity of these bike stations to businesses which could be classified as fast food, retail, restaurant, etc. which could be found by looking up the businesses registered in the area and their location which could likely be found from a variety of official sources. First I would examine the data for missing data and clean the data set, I would also look to see if the data is distributed normally in terms of number of businesses in each zone of nyc. Then I would look at how many businesses are around each station and if that coincides with the number of rides taken from each station. Then I would look at each individual business types and see if any of them have a stronger correlation with the number of rides taken compared to the overall number of businesses.

An additional area of data we would want to analyze is residential data on household income for those living in NYC. We could gather this information from the census and clean the data and do initial visualizations on the data distribution. Then we could see if there is a correlation between ride frequency and/or usage duration for those of different income brackets. This could help us see if different income brackets get more use out of citibike which could help us target new developments of the same income brackets or shifting neighborhood composition to target the income brackets that use citibike.